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Abstract 

In this paper we introduce a new class of multivariate unimodal distributions, 
motivated by Khintchine’s representation. We start by proposing a univariate 
model, whose support covers all the unimodal distributions on the real line. The 
proposed class of unimodal distributions can be naturally extended to higher di¬ 
mensions, by using the multivariate Gaussian copula. Under both univariate and 
multivariate settings, we provide MCMC algorithms to perform inference about 
the model parameters and predictive densities. The methodology is illustrated with 
univariate and bivariate examples, and with variables taken from a real data-set. 

Keywords : unimodal distribution, multivariate unimodality, mixture models, nonpara- 
metric Bayesian inference. 

1 Introduction 

The clustering of data into groups is one of the current major topics under research. It 
is fundamental in many statistical problems; most notably in machine learning, see, for 
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example, l29l . 

The traditional approach is to model the data via a mixture model, see HQ for fi¬ 
nite mixture distributions, and, more recently, DU. The idea, which carries through to 
infinite mixture models, is that each component in the mixture is represented by a para¬ 
metric density function, typically from the same family, which is usually the normal 
distribution. This assumption, often made for convenience, supposes each cluster or 
group can be adequately modeled via a normal distribution. For if a group requires two 
such normals, then it is deemed there are in fact two groups. Yet two normals could be 
needed even if there is one group and it happens to be skewed, for example. 

On the other hand, if we start by thinking about what a cluster could be represented 
by in terms of probability density functions, then a unimodal density is the most obvi¬ 
ous choice. Quite simply, with lack of further knowledge, i.e. only observing the data, 
a bimodal density would indicate two clusters. This was the motivation behind ll25l . 
which relies heavily on the unimodal density models being defined on the real line, for 
which there are no obvious extensions to the multivariate setting. 

There are representations of unimodal density functions on the real line (e.g. m 
and [TO]) and it is not difficult to model such a density adequately. See, for example, 
0.123 and l25l . Clearly, the aim is to provide large classes of unimodal densities and 
hence infinite dimensional models are often used. 

However, while there is an abundance of literature on unimodal density function 
modeling on the real line, there is a noticeable lack of work in two and higher dimen¬ 
sions. The reasons are quite straightforward; first there is a lack of a representation for 
unimodal distributions outside of the real line and, second, the current approaches to 
modeling unimodal densities on the real line do not naturally extend to higher dimen¬ 
sions. 

The aim in this paper is to introduce and demonstrate a class of unimodal densities 
for a multivariate setting. While marginal density functions will be modeled using the 
fl8l representation on the real line, the dependence structure will be developed using 
a Gaussian copula model. To elaborate, using an alternative form of representation of 
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unimodal densities on the real line, we can write a unimodal density as 



x. 


( 1 ) 


This then naturally extends to a bivariate setting with marginals of the form 0 via the 
use of a copula density function c(xi, X 2 ); 



Using a Gaussian copula, which has the ability to model pairwise dependence, we can 
then easily proceed upwards to the general multivariate unimodal density. 

The layout of the paper is as follows: In section 2 we provide the key representation 
of unimodal densities on the real line which we adapt in order to define continuous den¬ 
sities. Otherwise there is an issue of continuity at the mode. With a full representation 
of unimodal densities, we need a novel MCMC algorithm, particularly, and perhaps 
strangely, to include a non-zero mode. Section 3 then deals with the multivariate set¬ 
ting. There are peculiarities to the higher dimension case with the MCMC and again it 
is the location of the mode parameter which needs special attention. Section 4 provides 
a real data analysis. 

2 Models for unimodal densities 

Our aim in this paper is to start with the representation of lfl8l for unimodal densities 
on the real line and extend it to higher dimensions in a natural way, using the mul¬ 
tivariate Gaussian copula. The representation of ED is given by Y = X Z, where 
X is uniform on [0,1] and Z , independent of X, has any distribution. Then Y has a 
unimodal distribution with mode at 0. To see this let us consider the situation of y > 0. 


So 



and hence 
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where g is used to represent the density function of Z. To see more clearly the uni¬ 
modality at 0 , let us use the transform s = y/x, so 

POO 

f(y)= s _ 1 g(s)s, ( 2 ) 

Jy 

which is maximized at y = 0 . 

This representation has been widely used and is usually presented as a mixture of 
uniform distribution; i.e. 

f(y) = J s_ 1 i(o <y <s) G(s). 

The prior for G, in a Bayesian nonparametric setting, is usually taken as a Dirichlet 
process (HD). To denote this we write G ~ D(M,Gq), where M > 0 is a scale 
parameter, and Go a distribution function. In particular, E(G) = Gq. 

Now 10 and lf23l . among others, have worked with this specification and extended 
to the real line with arbitrary mode by using U(y\n — s, k + s) for some k > 0. This 
specification leads to a symmetric density. 

Furthermore, |25l develop the model to allow for asymmetry by adopting the idea 
of HU: incorporating an asymmetry parameter A in the uniform kernel, 

f(y\X,n,G) = J U (y\K- se~ x ,n + se~ x ) G(s). (3) 

So ([3]) defines a unimodal density determined by (A, k. G), where k is the location 
parameter, A defines the asymmetry, and the distribution G defines characteristics such 
as variance, kurtosis, tails and higher moments. With this representation, the support 
of the model proposed by lf25l includes all symmetric unimodal densities and a large 
class of asymmetric ones. 

However, when extending to the real line, and using a natural density for g such as 
the normal, we encounter the situation of /(0) = 00 . In fact, on the real line, g needs 
to be somewhat unusual to ensure that /(0) 7 ^ 00 . See the form of density in (|2j. 

To resolve this problem, we instead work with Y = X/Z; so again looking at 
y > 0 , we have 

P(Y < y) = [ P(Z > x/y) x. 

Jo 
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Hence, 


f(y)= [ ( x/y 2 )g{x/y)x 
Jo 


and using the transform s = x/y we obtain 


ri/y 

f(y)= sg(s) s. 
Jo 


( 4 ) 


(5) 


Thus we see the mode is again at y = 0 and /(0) will be finite subject to the more 
reasonable assumption that sg(s) is integrable at 0, rather than the more unreasonable 
assumption that s~ 1 g(s ) is integrable at 0 in the former setting. 


2.1 A class of unimodal densities 

Here we further investigate the choice of Z being a normal distribution with the repre¬ 
sentation Y = X/Z. We can easily extend (|5]) to the whole real line; resulting in 

"1 /v /-O 


r>-/y r u 

f(y) = l(y > 0) / s g(s) s + l(y < 0) / |s|s(s)s. 

JO Jl/v 


(6) 


10 Jl/y 

For a mode at n ^ 0 we simply exchange y for y - n. If the mode is at 0, then for 
linij/4.0 f(y) = lim yt o f(y), we need 

r 0 


/*oo />U 

/ sg(s) s= / \s\g(s)s. 

Jo J oo 


Our aim is to take g as a large class of density functions on the real line and the full 
class is given by mixtures of normal distributions. Thus, it is expedient to compute |6| 
for g(s) normal with mean // and variance a 2 . That is, for £ > 0 

/(£) = [ sN{s\^,a 2 ), 

Jo 

which, after some algebra, results in 

m = n *i i-$( — ) +<7 


-y 




where <j> and <f> are the pdf and cdf, respectively, for the standard normal distribution. 
With £ < 0 we simply need to rearrange the signs, so 

-mM , L (-y\ <(£-y 


/(0 = 


y 


$ [ ^ H | _ $ 


+ a 


(7) 
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Hence, this /(£) holds for all £ £ (—00, +00). 

To move the mode to re, rather than 0, write f(y\n, <7, re) = /(l/(y — re)) with /(•) 
given by <0 We could also maintain the representation (|4j in which case we can write 

2^ 


= —^= {x/{y-n)} 2 expj-i ^ 


x/{y - re) - n 


The usefulness of this is that it provides a latent joint density function which will be 
helpful when it comes to model estimation, namely. 


f(y,x\n,cr, re) oc {x/(y 


re )} 2 


exp 



x/{y - «) - 


( 8 ) 


For the forthcoming mixture model, where we consider mixtures of normals for Z, we 
find it convenient for identifiability concerns, to use the parametrization (//.. c), where 

c = Wo-) 2 - 


2.1.1 The mixture model 


The model proposed for the class of unimodal density functions on 1R is a mixture of 
the unimodal densities with mode at re proposed in section 2.1 i.e. 0- This model can 
we written as 


f(y) = '^2‘w j f{y\n j ,c,K). (9) 

3 =1 

The mode is therefore at re and we allow parameter c, which is a transformation of 
the coefficient of variation, to remain fixed across the normal components. This is 
analogous to keeping the variance terms fixed, for identifiability concerns, in a mixture 
of normal model. The support of the model is not diminished by doing this. 

Therefore, we can write 


f(y) = j f(y\ c, re) G(/t) (10) 

where G is a discrete distribution with mass Wj at the point //,. Hence, in a Bayesian 
context, the parameters to be estimated and first assigned prior distributions are 
c, re), with the constraints being that the weights sum to 1, c > 0, and the 
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[jij) are distinct. As an illustration. Figure [T| shows the density function of a mixture 
of two components with k = 10, c = 1, y, = (—5, 20) and w = (0.8, 0.2). 


The claim is that model 10 can be arbitrarily close to any unimodal density on the 
real line. The key is writing y = X/Z rather than y = XZ, and with the former we 
can employ the full support abilities of a mixture of normals without the problem of a 
discontinuity at the mode. 


2.2 Prior specifications and model estimation 

2.2.1 Prior specifications 

To complete the model described in the previous section, we now set the prior distribu¬ 
tions for the unknown model parameters. Specific values are assigned in the illustration 
section. 

• The prior distribution for the parameters (pj) are normal with zero mean and 
variance crj); while the prior for k will also be normal with zero mean and vari¬ 
ance cr^. The prior for c will be gamma with parameters (a c , /3 C ). 

• The prior distribution for the weights (wj) has a stick-breaking construction 
(123), given by 

Wi = Vi and Wj = Vj — Vi) for j > 1, 

Kj 

with the (vj ) being i.i.d. from a beta distribution with parameters 1 and M. We 
assume a gamma, ga(aM, Pm), as the hyper-prior for the parameter M. 

As a consequence, the prior for G in (jT0|» is a Dirichlet process with mean distribu¬ 
tion N( 0, <rj)) and scale parameter M. 

Before detailing the MCMC algorithm for this model, we will now introduce the 
key latent variables which facilitate an “as easy as possible” implementable algorithm. 
First, to deal with the infinite sum of elements in |9]), we introduce a variable u, such 
that we have the joint density with observation y, as 

OO 

f(y,u) = J2 1 ^ U< ^}( w f/^)/(y \H’ c >k) 

j =i 
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for some decreasing deterministic sequence (£j). See lfl 6 l for details and choices of 
(£j). In particular, we use ^ = exp(— 7 j), with 7 = 0.01, but other sequences can be 
used. 

The key here is that given u, the number of components is finite, and correspond to 
the indices A u = {j : > u}, such that 

f{y M = f(y\vj’ c i K )- 

Next we introduce the component indicator cl for each observation, which tells us from 
which component available the observation came from, resulting in the joint density 
with (y, u ) as 

f(y, u,d) = l(u< £ d ) {w d /id) f{y\Hd, c, k). 

The joint likelihood function, including latent variables, is then given by 

n 

n 1 2 't Ui < } (W di /£ di )f{Vi | ld di ,C,K), 

i=1 

and it will be convenient to define D = max{<■/, }. 

2.2.2 The MCMC algorithm 

It turns out that the most complicated parameter to deal with is k; surprisingly, since 
often the location parameter is one of the easiest to sample in a MCMC routine. Thus, 
we will initially describe the algorithm assuming k is known. The unknown quan¬ 
tities that need to be sampled from, under that scenario, are: = 1 , 2,.. .; 

(di, Ui), i = 1,..., n; c and M. The sampling of these parameters and latent variables 
is as follows, obviously conditional on other parameters and variables being known in 
each case: 

1. Sample m from the uniform distribution on (0,^), for i = 1 ,n. A useful 
summary to record here is N = max{£ - 1 (di)}. 

2. Sample from M following the approach in (3). Now M only depends on the 
other model parameters through the number of distinct (di) and the sampling 
goes as follows, 
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• sample v ~ B{M , n), where M is the current value for M. 

• sample M from ga(ctM + k, Pm — log v), 

where k is the number of distinct (d,). 

3. Sample (/j ;/ ) for j = 1,..., N via Metropolis-Hastings (M-H). If there is at least 
one d, = j then we use the M-H step, otherwise the Hj comes from the prior. 
The proposal /j* is taken to be normal, centered at the current value, i.e. 

d*j = H + N( 0, M- 

The M-H acceptance criterion is given by: 

_ II {di =j } f{yi\dpc^)Tt(n*) 

q n {d i= j} c, k)tt(H j) ’ 

where tt represents the prior distribution. We accept the proposal with probability 
min{l, q}. 

4. Sample c via Metropolis-Hastings. The proposal c* is obtained from: 

log(c*) = log(c) + N( 0, h c ). 

The M-H acceptance criterion is given by: 

= mil fiVilMdi, c*,k)tt(c*)Q(c*;c) 

nr=i/(^l/ i di,c,/c)7r(c)<9(c;c*) 

where Q(c*;c) denote a source density for a candidate draw c* given the current 
value c in the sampled sequence. In this case, Q(c*; c) = 1/c*, and we accept 
the proposal with probability min{l, q}. 

5. Sampling from di can be done directly since we have the probability mass func¬ 
tion, 

P(di= j) =f(yi\Hj,c,n) for j = l,...,N u 
si 

where TV,; = rnax-jj : Ui < £j}. 
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When the mode k is unknown, simply adding an extra step in the algorithm above 
is neither efficient nor viable. The reason for this is that k and the are highly 
correlated. Data values coming from cluster j with fi :i > 0 are typically larger then k, 
while if /i_, < 0, they are typically smaller. The consequence is that for fixed values of 
Hi , any proposals for k are prone to be rejected. 

The explanation for this is the following. Let {j/(i), 2/(2)? • • •, y( n )} be the ordered 
data, and suppose the initial/current value for k, written as «*, in the MCMC algorithm, 
places it between and yy+i), for some l = 1,..., n — 1, as shown in Figure|2] It is 
likely that {t/ (1) ,..., yp)} belong to clusters with negative fij, and {t/p+i),..., 
belong to clusters with positive /z ( . This is true particularly for data which is close 
to k, as from (|8j) we can see that they correspond to large values of These large 
values typically generate positive values of x/z when ji 3 > 0, and negative values of 
x/z when yj < 0. 

Suppose we start the algorithm assigning to the data clusters whose //,^ have a sign 
which is negative for yi smaller than and positive otherwise. If we now propose 
moving the value of k to, say, somewhere between yp+i) and yn +2 ), it means that 
yn+i) will be incompatible with its corresponding cluster. The same happens if we 
move k in the opposite direction. More drastic moves make the problem more acute. 

A solution for this would be to test moves of k and (/i,^ J jointly in a Metropolis- 
Hastings step. That, however, would be very inefficient due to the possible high num¬ 
bers of (jij) involved. Our idea then is to test a new proposal for k jointly with moving 
the affected yi, so the M-H step utilized to sample k, and possibly some of the di, is 
now described. 

To make the proposal, first sample auxiliary variable s from a discrete uniform 
distribution in the interval [—m,m\, for a chosen m € IN + . This sampled value will 
give the size and direction of the move we wish to propose for n. 

As a toy example, suppose we choose to = 2. Therefore, s will be either —2, — 1,0,1 
or 2, with the same probability. If k is placed between order statistics yn^ and yp+i) we 
have that: if s = —2, n* will be sampled uniformly within the interval [t/(;_2)> J/p-i)]; 
if s = —1, k* will be sampled uniformly within the interval [t/(;_i), J/(z)], and so forth, 
as illustrated in Figure [2] 
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Note, however, that near the edges (y^i) and t/(„)) not every one of these movements 
can be made. For instance, in the example above, if l = 2, the movement proposed by 
s = —2 is not possible, since there is no interval [yo,yi\- hi that case, s is sampled 
uniformly between the possible values, which in this case would be s = — 1 , 0,1 and 
2. 

Generalizing the notation, n* will be sampled within the interval [n a ,Kb], where 
n a = max{t/ (ft+s ),?/(!)}, and n b = min {y^ h+s+1) , t/ (n) }, with h representing the 
rank of the order statistics of the y which is immediately ranked lower than k. In our 
toy example, h = l. 

If s = 0, only a new proposal for n will be made, k* ~ U(K a ,n b )', otherwise 
we will also test moving the observations which fall between k and n* to a different 
cluster. If k* > k, we propose moving these observation to the same cluster of y(h)- 
If k* < k, we propose moving them to the same cluster of yih+i ) • In our example, if 
s = 2, it means proposing that d°[h + 1 ] and d°[h + 2] are equal to d°[l], where d°[l] 
represents the cluster to which y^ belongs to. In the general case, we have: 

• if s < 0 , we propose that d°[h + s + 1 ],..., d°[h\ are equal to d°[h + 1 ]; 

• if s > 0 , we propose that d°[h + 1 ], ..., d°[h + s] are equal to d°[h]. 


The Metropolis rate is given by: 

_ p(n*,d*\-) ^ nr=i f(y(i)\Vd pz ,c, K*)Tr(K*)Q{K*,d*;K,d) 
P(n,d\-) * niLi f(.y(i)\yd i ,c,K)ir(K)Q(K,d- 1 K*,d*) 

where 


Q(k* , d*; k, d) 


(min{?r, h + ?n} — max{ 0 , h — m}) 1 
K b - K a 


and, analogously. 


Q(k, d ; k* , d*) 


(min{n, h + s + m} — max{ 0 , h + s — ?n}) 1 
2/O+t) “ 2/0) 


We accept the proposal with probability min{l, q}. 
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2.3 Examples with simulated data 

To test our model and the proposed MCMC algorithm, we simulate two artificial data¬ 
sets, and predict their densities. The idea here is verify if we can efficiently reconstruct 
the densities, which in this case are known. The simulated models are given bellow: 

A. y is sampled from (100,100 2 ); 

B. y is sampled from ga(3, 10). 

Data-sets of size n = 100 were simulated from models A and B. Figure [3] shows 
the histogram of these data-sets and the density of the distribution they were generated 
from. All figures included in this article were produced via the software R ( l24l l. 

The two sets of observations are modeled through[9] with the following specifica¬ 
tions for the parameters of the prior distributions: er^ = 10, tr^ = 10000, a c = /3 C = 
0.1, and a« = /3m = 0.01. 

For each data-set, the algorithm proposed in section 4 was applied to sample from 
the posterior distribution of the model parameters. The MCMC was implemented in 
the software Ox version 7.0 (®), with T = 300000 iterations and a burn in of 10000. 
Due to auto-correlation in the MCMC chains, the samples were recorded at each 100 
iterations, leaving us with samples of size 2900 to perform inference. 

Figure [4] shows the histogram of the sample from the posterior of parameter k 
under data-sets A and B. Note that in both cases the real value of the parameter is well 
estimated, falling close to the mode of the estimated posterior density. 

Figure [5] shows the histograms of the predictive distributions under both data-sets 
compared with their real density. It can be seen that in both cases the predictive distri¬ 
bution is close to the real density, showing the flexibility of the model and validating 
the methodology. 


3 Models for multivariate densities 


This section is divided in two sub-sections. Firstly, in sub-section 3.1 we discuss 
different definitions of multivariate unimodality that can be found in the literature. In 
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sub-section 3.2 we propose a class of multivariate unimodal densities, extending the 
univariate densities of section [2Tl 


3.1 Multivariate unimodality 

The concept of unimodality is well established for univariate distributions, and, fol¬ 
lowing Khintchine’s representation ( || 18| ), it can be defined in different but equivalent 
ways. It is not straightforward, however, to extend the notion of unimodality to higher 
dimensional distributions, as different definitions of unimodality are not equivalent 
when extended. 

The first attempts to define multivariate unimodality were made for symmetric 
distributions only, by III and l28l . Again under symmetry, Q introduced the no¬ 
tion of convex unimodality and monotone unimodality. The authors also defined lin¬ 
ear unimodality, which can be applied to asymmetric distributions. A random vector 
(Yi,..., Y n ) is linear unimodal about 0 if every linear combination a i^i has a 
univariate unimodal distribution about 0. This, however, is not a desired definition, as 
pointed out by (7] themselves, as the density of such may not be a maximum at the 
mode of univariate unimodality. 

Further, ETl define a—unimodality about 0 for a random variable Y in IR" when 
for all real, bounded, nonnegative Borel function g on IR" the function t a E(g(tY)) is 
non-decreasing as t increases in [0, oo), where E denotes the expectation with respect 
to the density of Y. If X has a density function / with respect to the Lebesgue measure 
y n on IR", then A' is a—unimodal if and only if t n ~ a f(ty) is decreasing in t > 0. 
The distribution of a random vector {Y \,..., Y n ) is said to be star unimodal about 0, 
according to Q, if it is n-unimodal in the sense of El- 

According to 0 , a linear unimodal distribution need not be star unimodal, and 
vice versa. Thus there is no implied relationship between star and linear unimodality. 

Another useful definition of unimodality is given by 0 : A multivariate density / 
on \R d is orthounimodal with mode at m = (mi,..., m.d) if for each j, f(y \,..., yd) 
is a decreasing function of yj as yj —> oo for yj > mrij, and as Xj —> — oo for yj < rrij, 
when all other components are held fixed. 
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Orthounimodal densities are widely applicable. They form a robust class in the 
sense that all lower-dimensional marginals of orthounimodal densities are also or¬ 
thounimodal. Also, they are star unimodal, according to 0 , and most bivariate densi¬ 
ties are either orthounimodal, or orthounimodal after a linear transformation. Narrower 
notions than orthounimodality can also be explored, and they are discussed in | 6 |. 

For the two dimensional case only, ll27l generalizes Khintchine’s stating that the 
distribution function of (Y), Y 2 ) is unimodal if and only if it can be written as 


(Y 1 ,Y 2 ) = (X 1 Z 1 ,X 2 Z 2 ), 


( 11 ) 


where (Xi,X 2 ) and (Z-\. Z 2 ) are independent and (Xi,X 2 ) is uniformly distributed 
in [0,1]. For higher dimensions, Khintchine’s representation was generalized by fl7l 
for the symmetric case. He defines symmetric multivariate unimodal distributions on 
]R d as generalized mixtures (in the sense of integrating on a probability measure space) 
of uniform distributions on symmetric, compact and convex sets in IR d . In this paper 


we will use a form of (11 1 while guaranteeing orthounimodality. 

Comprehensive reviews on multivariate unimodality can be found in 2) and (71 . 
Also CD reviews the literature and introduces and discusses a new class of nonpara- 
metric prior distributions for multivariate multimodal distributions. 


3.2 Proposed model 

To extend the univariate unimodality proposed in section |2.1| we start by de¬ 
fining marginal variables Yi, Y 2 ,...,Yd via Y/ = Xi/Zi, where Xi ~ (7(0,1) 
and Zi ~ /q), for l = 1, 2,..., d. The dependence between variables Y = 

(Yi, Y 2 ,..., Yd) can be obtained imposing dependence to either Z = (Z \,..., Z r i) or 
X = (X 1 ,...,X d ), or both. 

Our first attempt was to allow dependence only in Z through a multivariate nor¬ 
mal distribution with a general variance-covariance matrix. This way our construction 
would be following the generalization of Khintchine’s representation made by l27l for 
the bivariate case. The dependence obtained between Y under that construction, how¬ 
ever, for a bivariate setting, is limited. Better results were obtained when imposing 
dependence in X instead, through a Gaussian copula. 
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To demonstrate: Figure [ 6 ] shows the approximate 95% confidence intervals of the 
correlations between Y when varying the correlations between Z (top) and X (bottom) 
respectively, through the interval {—1, —0.9, —0.8,..., 0.9,1}. These confidence in¬ 
tervals were found by the simulation of 100 samples of size 100. The figure was con¬ 
structed considering y = (10,10) and // = (—10,10). Similar results to the ones with 
/i = (10,10) were found for y, = (—10, —10). In the same way, y = (10, —10) and 
/i = (—10,10) present similar results. The magnitude of y did not seem to change 
this output significantly. It can be seen that when imposing dependence through Z, the 
correlation between Y do not vary much beyond the interval [—0.2,0.2], while the cor¬ 
relation between the (T),..., Y d ) and the correlation between the {X\ ,..., X,/) was 
found to be similar. 

Due to the results just described we opted to work with the second construction: 
i.e. Yi = Xt/Zu Z x l = 1 ,..., d, and X = (X u X 2 ,..., X d ) are 

modeled through the Gaussian copula with correlation matrix E. Once again, the mode 
at k = (0,..., 0) can be exchanged, substituting Y for (Y" — k) where k, as well as Y , 
is now a d —dimensional vector. 


Given x, iij = (nij ,..., c = (ci, ..., c d ) and k = (k\, ..., n d ), the obser¬ 
vations y = (j/i, r/2, • • • > Vd) are independent, and f(y\x, y-j,c, k) is given by 

d 

f(y\x,Hj,c,K.) = II ( Xl /yf )gi(xi/yi\fj>ij,ci,Ki). 

i=i 

The marginal model, f(y\yj, c, k), can be obtained through the integral 

n. d 

f(y\Hj,c,K)= / \[(xi/y‘t)gi(xi/y l \y, l j,cum)c(x l ,...,x d \Y), (12) 

tl 

where c(x] ,... ,x d \Y) represents the multivariate Gaussian copula with correlation 
matrix E: 


c(x i,..., tCrf|E) 



f, 

1 $ ^i) ^ 


1 $ x (xi) \ 1 

l 

v/M exp ' 


V $_1 (®d) ) 

(E- 1 - I) 



Note, however, that the dependence created between variables Y is defined not 
only by the correlation matrix E, but also by the sign of the components of /i. As 
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an illustration, we sampled four bivariate data-sets, and examined the dispersion plots 
between y\ and t/ 2 . Figure[7]shows the dispersion plot between data sampled with p-\ = 
3, and either p 2 = 10 or p 2 = —10, and with p = £[1, 2] = 0.5, p = 0.8 or p = —0.8. 
It can be seen that positive dependence is created when (pi x //. 2 x p) is positive, and 
negative dependence is created otherwise. To help the model identification, we assume 
that p £ [0,1]. This restriction does not take away the model flexibility, and negative 
correlations between the variables can be captured by opposite signs in //. 

Let us now define the mixture of these densities, which is the unimodal multivariate 
model of interest. For observation Y = (Yi,..., Yd) the model is: 


f(y) = ^2 w jf(y\H,c,K), 

3=1 


where f(y\pj , c, k) is in (12 1 . Note that, by construction, the proposed distribution is 
multivariate orthounimodal, which is, therefore, also star unimodal. Unlike the uni¬ 


variate case, however, we cannot solve the integral in ( 121 and obtain f(y\pj, c, k) 
analytically. Therefore, we must come up with a different algorithm to solve the sam¬ 
pling of n. As a “bridge” to the multivariate case, this algorithm is first developed for 
the univariate case, and then extended to higher dimensions. The univariate “bridge” 
algorithm is presented in subsection [33] 


3.3 MCMC algorithm for univariate “bridge’ 


To mimic the situation in which the latent variable x cannot be integrated out, we 
present in this section an algorithm which samples from x = (x' i ...., x n ), and samples 
from the other parameters given x. We consider the same prior distributions specified 
in section 2.2 For x we assume a uniform prior at [0,1]". The algorithm goes as 


follows: 


1. Sample u z and obtain N = max{£ 1 {di)} as previously in section 


2.2 


2. Sample M as previously in section [23] 

3. Sample (/j 7 ) for j = 1,..., N as before, but using f( r yi\x l . /j 7 , c, k) (propor¬ 
tional to [ 8 ]) instead of f{y,\pf n c. n) in the Metropolis ratio. This way, the M-H 
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acceptance criterion is given by: 


= nr=l f{yi\ x ii ^di,C*,K)TT(c*) 

Yl r i=\f{yi\ x iiV‘d i ,C,K)'K{c) 

4. The full conditional distribution of c in this case has a closed form, and it is 
updated via Gibbs sampling. Given y, x, // and k, we have: 

( H i n ( Xi \ 2 

c\x,n,K,y~ga -+a c , - V —-- fj, di ) + /3 C 

5. Sample (x, d, k\ /jl, c, y) in two stages: first sample from f(d, k\ /*, c, y) and then 
from f(x\d, k. /z, c, y). d and k are sampled as before, and (&*) for i = 1,..., n, 
is sampled via rejection sampling, as follows: 



sample a proposal Xi ~ /7[0, 1 ]; 

• compute f(xi\c, n di , Vi, k) oc ijexp <( -- 2/j2 


(sfc - w) 


• compute the value x which maximizes the function: 

• f, Vdiidi-K) 

X = mm 1,--- 


{yi - n)JfJ- 2 d ./c + 0.25n di 




accept £i with probability min |l, j, or go back to the first 

step. 


The results obtained under this algorithm were similar to those obtained from the 
previous one, despite being considerably slower. The new algorithm, however, can be 
easily extended to the multivariate case. In this paper we will discuss the algorithm and 
results obtained for the bivariate case, and leave higher dimensionality for future work. 


3.4 MCMC algorithm for bivariate observations 

In this section we present the algorithm developed to handle bivariate observations 
( p = 2). Under this scenario, the unknown quantities are: (/i ; = H 2 : j)-, Vj),j = 

1 . 2 .. ..; ( di,u,i),i = l,...,n; c = (ci,c 2 ); M; k = (ki,k 2 ), %u,l = 1 ,2,* = 

1.. .., n, and also the correlation parameter in the bivariate Gaussian Copula ((>). As 
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pointed out in Section 3.2 without compromising the model flexibility, we can assume 
that p £ [0,1]. Therefore we assigned a U[0, 1] prior for this parameter. 

The algorithm proposed for the bivariate case is given below: 


1. Sample Ui and obtain N = max{£ - 1 (<ii)} as previously; 

2. Sample M as previously; 

3. Sample (pij) and independently for j = 1 ,N, following the same 

idea as in ( |3.3| ); 

4. Sample C\ and C 2 via Gibbs sampling. Given y, x, //, and k, we have: 

ci\x,h,k ~ ga +a c , K ~ + ’ 1 = lj2 ' 

5. Sample ki and k-> independently, using a similar algorithm as previously. Note 
that every time k \ and k-> are updated, some elements of d might also be changed. 

6 . Sample d, directly from its probability mass function, 

vo ■ 

P(di = j) = - l f{yii\^ij,c,K, 1 )f(y 2 i\/J J 2 j,c^ 2 )iorj = 

si 

where TV, = max{j : itj < £j}. 

7. Sample (j;i,a; 2 ) via rejection sampling. Here we consider two possible algo¬ 
rithms: 


7.1. 


• Sample from the copula C(x* i , x^)'- 


1 P 
P 1 


• compute 

n dli ,y u ,P-d 2i ,y2i, k) oc JT^exp < -- 

j = 1 






P’ji 
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• compute the values xi and x' 2 which maximize the function above: 



• accept with probability 


min 



or go back to the first step. 


7.2. 


• Sample x* { from the function 


f(x*i,xl i \c,n dl .,yu,fJ'd2i,y2i,K) oc a^exp 



through adaptive rejection sampling (Q4|); 

• compute c(xli, x^); 

• compute the values X\ and £2 which maximize the copula: x 1 = px 2 and 

X\ = pX\, 

• accept (x*ijX^i) with probability 



or go back to the first step. 

The first algorithm proposed to sample from x leads to faster convergence. How¬ 
ever, it can be very slow at times, requiring a large amount of simulations until 
acceptance. We combined the two algorithms in the following way: Sample from 
7.1 until it either accepts the proposal or reaches a set limited number of trials. 
If the latter occurs, sample from x through algorithm 7.2. 

8. Sample p via Metropolis-Hastings. The proposal p* is obtained from: 
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for a suitable choice of a 2 p . 

The M-H acceptance criterion is given by: 

= f{p*\xi,x 2 )Q(p*;p) 
q f(p\xi,x 2 )Q(p;p*) ’ 

where Q(p*;p) = l/(p*(l — p*)). We accept the proposal with probability 
inin{l, q). 


3.5 Examples with simulated data 


In this section we present the results obtained for a simulated bivariate data-set of size 
n = 100, from a bivariate Normal distribution: y ~ N(k, f2) with k = (30, 60), and 
1 Py 


Q = 10 


Py 


1 


, with p y = 0.5. 


The algorithm proposed in section 3.4 was applied to sample from the posterior 
distribution of the model parameters and obtain samples of the predictive distributions. 
The algorithm was run for T = 75000 iterations, with a burn in of 5000. Due to auto¬ 
correlation in the MCMC chains, samples were recorded at each 50 iterations, leaving 
a sample of size 1400 to perform inference. 

Figure [ 8 ] shows the histograms of the posterior densities of the components of pa¬ 
rameter k. It can be seen that this parameter was reasonably well estimated, with the 
true value being close to the posterior mode. Figure |9]presents a comparison between 
the histograms of the simulated samples of y-\ and y 2 , and the ones predicted through 
the proposed MCMC algorithm, showing that the predictions seem close to what was 
expected. 

We also wish to verify if the dependence between y \ and y 2 was preserved in the 
predictions. To do this analysis, we computed the correlation at every 100 samples of 
the predicted (yi,y 2 ). This time we used the full T = 70000 sampled values, ending 
up with a sample of size 700 correlations. Figure [TO] compares the histogram of these 
correlations, which can be seen as a proxy of the posterior distribution of p y , to the real 
values (in red), showing good predictions. 


20 




4 Boston Housing Data 


As an application, we work with a part of the Boston Housing data-set created by 03 
and taken from l20l . This database comprises 506 cases of 12 variables concerning 
housing values in the suburbs of Boston, and it has been used in many applications, 
especially in regression and machine learning, such as El and E2- We chose to work 
with two variables of the database: nitric oxides concentration (parts per 10 million) 
at the household (NOX) and weighted distances to five Boston employment centers 
(DIS). Exploratory analysis points towards the unimodality of the joint density of NOX 
and DIS, as can be seen by the contour plot displayed in Figure |TT] This contour 
plot also shows a negative, unusually shaped, dependence between these variables. 
Histograms of observations NOX and DIS are also presented in Figure El showing 
that both variables have a non-normal unimodal shape. 

Our objective in this application is to illustrate the flexibility of our approach in 
capturing the form of this bivariate distribution and the oddly shaped dependence be¬ 
tween NOX and DIS. We worked with a sub-set of n = 100 observations and applied 
the proposed bivariate model with the same priors used for the toy examples. Again, 
the algorithm was run for T = 75000 iterations, with a burn in of 5000, and recorded 
at every 50 iteration, totalizing a sample of size 1400 for inference. 

Figure [l2| shows the histograms of the predictive density of NOX and DIS, and the 
contour plot made with 500 points of the predictive distribution. We observe a high 
similarity between the real and the predicted histograms and dispersion plots, and can 
conclude that the proposed methodology was able to capture well the behavior of the 
data for this example. 

Note that our approach also allows for predictions of one variable given the other 
after samples of both had been previously observed. As a second exercise, we analyze 
the predictive capacity of our model, compared to a more usual linear alternative. If 
the objective is to predict nitric oxides concentration based on weighted distances to 
employment centers, a regression model would probably be considered. As can be seen 
in Figure |TT] however, the data needs transformation for the normal linear regression 
model assumptions to hold adequately. After exploratory analysis, we propose the 
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following regression model: 


NOX, 1 = a + /31og(DISj) + e*, i = l,...,n, 
ei ~ 7V(0,crg), 

with vague Normal priors for a and p, and a vague inverse Wishart prior for ap 

Figure [13] shows the box-plot of the response variable NOX -1 and the dispersion 
plot between NOX” 1 and log(DIS), showing a linear dependence between both vari¬ 
ables. The simple regression was fit using software OpenBugs 3.2.3 ( 11301 ). 

A comparison is then performed between predictions of 10 values of NOX given 
DIS, after observing n = 100 cases of both DIS and NOX, under both models. Fig¬ 
ure [14] presents the 95% credible intervals and posterior medians under the proposed 
nonparametric model and the simple regression model, being compared with the real 
values. Figure [represents the histograms of the posterior distribution of the first three 
predicted values. As is typical, nonparametric methods produce larger credible inter¬ 
vals than the parametric counterparts, yet are typically centered in the true values. The 
parametric versions however can simply be wrong. 

5 Final remarks 

In this paper we have proposed a new class of unimodal densities whose support include 
all the unimodal densities on the real line. Our motivation for this is that a mixture of 
the proposed univariate model can be adequate to cluster data, with each cluster being 
represented by a unimodal density. 

One of our objectives was also to be able to create a class of unimodal densities that 
could be naturally extended to the multivariate case. Our models allow this through the 
use of the multivariate Gaussian copula. This way, modeling multivariate clusters can 
also benefit from the methodology developed in this paper. 

The proposed models, however, cannot be dealt with analytically. We have also 
proposed an MCMC algorithm to obtain samples of the posterior distribution of the 
model parameters and to obtain predictive densities. The methodology was illustrated 
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with univariate and bivariate examples, and with two variables taken from the Boston 
Housing Data ( lfl5l ). 

Modeling multivariate unimodal densities with dimension higher than two can be 
easily done with a slight modification to the code presented in section [3~4| In that 
case, instead of sampling from a single correlation parameter p , we must sample from 
a correlation matrix, which can be done following (32]. Preliminary results showed 
this to be effective for a three-variate model. For future work we must test the code 
extensively for three or more dimensions, and finally do clustering for an arbitrary 
dimension d, through the mixture of the densities proposed in this paper. 
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Figure 1: Unimodal density function © with a mixture of two components, n = 10, 

fi = (-5, 20), w = (0.8, 0.2). 


y<i-2> y ( M) v,D i y<i+i) y<i+2) y<i+3) 

H -+-+ 1 

s = -2 s = -1 K s = 1 s = 2 

s = 0 

Figure 2: Toy example of initial value for k and possible values of s in the MCMC 
algorithm. 




Figure 3: Histograms based on n = 100 points sampled from models A and B. The 
black line represents their respective densities. 
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Figure 4: Histograms based on 2900 samples from the posterior of parameter k under 
observations from models A and B. Red lines indicate the true values of the parameter. 



-200 O 200 400 



0 0 0 2 0 4 0 0 0 8 1 0 1.2 


Figure 5: Histograms based on 2900 samples from the predictive distribution under 
observations from models A and B. Black line indicates the true density. 
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Figure 6: 95% confidence intervals of the correlation obtained between ( Y\. Y - 2 ) when 
varying the correlation between (Z\. Z?) (top) and (X ]. Xz) (bottom) respectively, 
through the interval {—1, —0.9, —0.8,..., 0.9,1}. 
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Figure 7: Dispersion Plot of samples y\ vs y-z of size 1000 from the simulated bivariate 
example. 
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Figure 8: Histograms of the posterior of k under simulated bivariate model. Red lines 
indicate true values. 


Y1 




Predicted Y1 


Predicted Y2 




Figure 9: Histograms comparing the original simulated samples (size 100) to the pre¬ 
dicted samples under the bivariate model. Black lines represent the true density. 
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Figure 10: Histogram of the estimated correlation between samples of size 100 of 
predicted ( 2/1 , J/ 2 )- The red line indicates the true correlation. 



Figure 11: Histograms of variables DIS and NOX, and contour plot of DIS vs NOX 
from the Boston Housing database. 





Figure 12: Predicted histograms of DIS and NOX, and predicted contour plot of DIS 
vs NOX under the proposed bivariate mixture model. 
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Figure 13: Histogram of NOX 1 and dispersion plot of log(DIS) vs NOX 1 with 
regression line. 
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Figure 14: 95% credible intervals and posterior medians of 10 “future” values of NOX 
being compared to their observed values (red dots) under the proposed nonparametric 
model (top) and a regression model (bottom). 
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Figure 15: histograms of the posterior of 3 “future” values of NOX under the proposed 
nonparametric model (top) and a regression model (bottom). Real observed values are 
indicated by red line. 
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